In [None]:
# if you need to install the statsmodel library
!pip install --user statsmodel

In [None]:
# if you need to install scikit-learn
!pip install --user sklearn

# Lab 6 - Review of linear regression

A linear relationship between two variables is one in which the scatterplot of them looks roughly like a line.  *Linear regression* is a method for modelling how a *dependent variable* linearly depends on one or more *independent variables*.  The dependent variable (also called a *response variable* and many other things) is what we are trying to model or predict, and is usually denoted $Y$.  The independent variables (also called *explanatory* or *input variables*) are the information we are using to make the predictiong, and are usually denoted $X_1, X_2, ...$.

The linear relationship is: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon$$
where $\epsilon$ represents the error.

Here, $Y, X_1, X_2, ..., X_n$ are *random variables* which is what mathematical variables that can take different values with different probabilities are called.

Linear regression finds the coefficients $\beta_0, \beta_1, ..., \beta_n$ so that the sum of the squares of the error term for each data observation is minimized (as small as possible).

In this lab, we will look at a classic dataset used for studying regression containing data about Boston housing prices in the 1950s.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns
import numpy as np

%matplotlib inline

Load the dataset from the sci-kit learn package.

In [None]:
from sklearn.datasets import load_boston
boston_dict = load_boston()

Display the variable `boston_dict`:

In [None]:
boston_dict

`boston_orig` is not a dataframe!  It is something called a *dictionary* that contains the information in separate properties, called *keys*.  We can get the list of keys:

In [None]:
boston_dict.keys()

To see what is linked to a key:

In [None]:
boston_dict.feature_names

To see the description of the data set:

In [None]:
print(boston_dict.DESCR)

What happens if you leave off print?

To convert the data into a Pandas dataframe:

In [None]:
boston = pd.DataFrame(boston_orig.data)

Display the dataframe `boston`:

There are no column names, so let's repeat that command, but telling Pandas that the column names are in `feature_names`:

In [None]:
boston = pd.DataFrame(boston_orig.data, columns = boston_dict.feature_names)

Display the new dataframe. 

Each column in an independent variable.  (The columns were described when we did `print(boston_dict.DESCR)`).  The dependent variable that we want to predict is linked to the key called `target`, so we will add it to our dataframe as a column.

In [None]:
boston["price"] = boston_dict.target 

Check the column was added correctly.

Let's look at how a few of the variables relate to the median housing price.

Let's see how the per capita crime rate per town (column `CRIM`) relates to median housing price (column `price` which is in $1000s) by plotting a scatter plot of the two variables.  

<details> <summary>Hint:</summary>
    <code>df.plot.scatter(x = "x_column_name", y = "y_column_name")</code>
</details>

Make a scatter plot of the relationship between the average number of rooms per dwelling (column `RM`) and the price.

Finally plot the relationship between the pupil-teacher ratio per town (column `PTRATIO`) and the price.

Of the three variable, `CRIM`, `RM`, and `PTRATIO`, which seems to have the most linear relationship with `price`?  Which of the relationships were positive (had a positive slope) and which were negative?

Let's perform linear regression using `RM` (average number of rooms) as the independent variable.  We want to predict `price` our dependent variable.

In [None]:
lm = smf.ols('price ~ RM',boston).fit()
lm.summary()

What's the equation of our linear model?

We can visualize the line using Seaborn:

In [None]:
sns.regplot(y="price", x="RM", data=boston, fit_reg = True)

The *residuals* are the difference between the actual value of price and the value predicted by the regression line, for each row of the data.

In [None]:
lm.resid

Plot a histogram of the residuals.  If the linear model is a good fit, the histogram should look like a normal distribution.  Does it?

To better understand the relationship between the actual prices and the predicted or *fitted* values, we can plot a scatterplot of the two.  To get the fitted values, type `lm.fittedvalues`.  Try it below.

Let's plot a scatter plot of with the price as the x variable and the fitted values as the y variable.  The code is slightly different since the fitted values aren't part of the `boston` 

In [None]:
plt.scatter(boston['price'], lm.fittedvalues)
plt.xlabel("Prices ($1000s)")
plt.ylabel("Predicted prices ($1000s)")
plt.title("Prices vs Predicted Prices")

If all the prices were predicted correctly, what would this plot have looked like?  By noticing where this plot is least like a plot with perfectly predicted prices, we can see where our linear model fails (makes bad predictions).  Where do you notice errors in the prediction?

The most expensive houses are predicted incorrectly.  This is known as a *ceiling effect*, where the independent variable no longer has an effect on the dependent variable.

Find the linear model showing how price depends on `PTRATIO`, the pupil-teacher ratio by town:

Check the fit of your model by looking at the histogram of the residuals and the scatter plot of the predicted prices vs. the actual prices.

Do you think this linear model is a good predictor of prices?

Let's try making a linear regression model with three independent variables:

- `CRIM` (per capita crime rate by town)
- `RM` (average number of rooms per dwelling)
- `PTRATIO` (pupil-teacher ratio by town)

The code will be similar to above, but the formula is `price ~ CRIM + RM + PTRATIO`

Check the fit of your model by plotting the histogram of the residuals and the scatter plot of the predicted prices vs. the actual prices.

How did this linear model perform?